Skip to content

fix: improve prometheus schema usage collection and implement sampling#2323

Merged
StarpTech merged 10 commits intomainfrom
dustin/eng-8469-add-documentation-and-inform-customer
Nov 11, 2025
Merged

fix: improve prometheus schema usage collection and implement sampling#2323
StarpTech merged 10 commits intomainfrom
dustin/eng-8469-add-documentation-and-inform-customer

Conversation

@StarpTech
Copy link
Copy Markdown
Contributor

@StarpTech StarpTech commented Nov 11, 2025

Summary by CodeRabbit

  • New Features
    • Probabilistic sampling for Prometheus schema field-usage metrics via a SampleRate (0.0–1.0) to emit representative subsets and reduce metric volume.
  • Configuration
    • Added sample_rate setting to telemetry.metrics.prometheus.schema_usage with schema, env var support, and default (1.0).
  • Tests
    • Added/extended tests validating 0%, 10%, and 100% sampling, metric counts, and label exposure.

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 11, 2025

Walkthrough

Adds probabilistic sampling for Prometheus schema field usage metrics via a new SampleRate (0.0–1.0). Threads SampleRate through config, schema, metric initialization, and runtime; operation-level emission now aggregates per-field counts and performs probabilistic sampling. Also renames promSchemaUsageIncludeOperationSha → promSchemaUsageIncludeOpSha and updates tests to cover sampling behavior.

Changes

Cohort / File(s) Summary
Configuration types & schema
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/metric/config.go
Add SampleRate float64 to PrometheusSchemaFieldUsage with yaml/env tags and JSON schema entry (range 0.0–1.0, default 1.0).
Config fixtures & test data
router/pkg/config/fixtures/full.yaml, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json
Update fixtures and test JSON to include sample_rate / SampleRate and adjust include_operation_sha in full fixture.
Config loading tests
router/pkg/config/config_test.go
Update tests to load and assert SampleRate from file and environment (new env var mapping/default).
Test environment plumbing
router-tests/testenv/testenv.go
Add SampleRate to testenv PrometheusSchemaFieldUsage and propagate it into router configuration used by tests.
Prometheus sampling tests
router-tests/prometheus_improved_test.go
Add and modify tests to validate sampling behavior at SampleRate values (1.0, 0.1, 0.0); exercise requests, read metrics, and assert labels and aggregated counts.
Core metrics — operation-level logic
router/core/operation_metrics.go
Add usageKey type, add promSchemaUsageSampleRate field, implement shouldSampleOperation() using math/rand, aggregate per-field usage counts in Finish(), emit counted schema-field usage measures, and rename include flag to promSchemaUsageIncludeOpSha.
Metrics plumbing & initialization
router/core/router_metrics.go, router/core/graph_server.go, router/core/router.go
Thread SampleRate through metric config and struct initializers; add promSchemaUsageSampleRate assignments and replace promSchemaUsageIncludeOperationSha with promSchemaUsageIncludeOpSha.
Tests & minor adjustments
router-tests/*, router/pkg/config/*
Update tests, defaults, and testenv to reflect new SampleRate; minor import/formatting tweaks in tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Review probabilistic sampling logic in router/core/operation_metrics.go for correct probability math and appropriate randomness source.
  • Verify aggregation/emission in Finish() produces correct counts and labels.
  • Confirm consistent rename from promSchemaUsageIncludeOperationShapromSchemaUsageIncludeOpSha across code and tests and that SampleRate flows through config → runtime → tests.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: improve prometheus schema usage collection and implement sampling' clearly describes the main changes: it adds sampling functionality and improves Prometheus schema usage collection across multiple files.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 11, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-b89f4bea6166fd2041b3f983bb5b58749f2ab5bd-nonroot

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
router/pkg/config/config.schema.json (1)

1249-1254: Sampling schema looks good; add examples and consider naming consistency (nit).

The field is well-bounded and defaults are sensible. Two optional nits:

  • Add examples to guide users.
  • Tracing uses "sampling_rate" while this uses "sample_rate"; keeping them aligned could reduce confusion (only if backward-compat allows).

Apply this minimal doc enhancement:

           "sample_rate": {
             "type": "number",
             "default": 1.0,
             "minimum": 0.0,
             "maximum": 1.0,
-            "description": "Sample rate for schema field usage metrics (0.0 to 1.0). Uses probabilistic random sampling, ensuring all operations get ~X% statistical coverage with uniform distribution. Only recommended for large scale deployments. Default is 1.0 (100% sampling)."
+            "description": "Sample rate for schema field usage metrics (0.0 to 1.0). Uses probabilistic random sampling, ensuring all operations get ~X% statistical coverage with uniform distribution. Only recommended for large scale deployments. Default is 1.0 (100% sampling).",
+            "examples": [1.0, 0.5, 0.1, 0.01]
           }
router-tests/testenv/testenv.go (1)

279-283: Avoid zero-value default changing behavior; make SampleRate a pointer with defaulting.

float64 zero-value (0.0) can unintentionally disable sampling in tests even though schema default is 1.0. Use a pointer to distinguish “unset” from an explicit 0.0.

 type PrometheusSchemaFieldUsage struct {
   Enabled             bool
   IncludeOperationSha bool
-  SampleRate          float64
+  // If nil, defaults to 1.0 in configureRouter to mirror schema defaults.
+  SampleRate          *float64
 }
router/core/router_metrics.go (1)

30-33: Good addition and rename; suggest clamping sample rate defensively.

The new fields and propagation look correct. Add a small clamp in NewRouterMetrics so out-of-range values from non-schema callers (e.g., tests) don’t leak through. Based on learnings.

 func NewRouterMetrics(cfg *routerMetricsConfig) RouterMetrics {
-  return &routerMetrics{
+  // Clamp sample rate to [0,1] defensively (test harness may bypass schema validation).
+  sr := cfg.promSchemaUsageSampleRate
+  if sr < 0 {
+    sr = 0
+  } else if sr > 1 {
+    sr = 1
+  }
+  return &routerMetrics{
     metrics:             cfg.metrics,
     gqlMetricsExporter:  cfg.gqlMetricsExporter,
     routerConfigVersion: cfg.routerConfigVersion,
     logger:              cfg.logger,
     exportEnabled:       cfg.exportEnabled,
-    promSchemaUsageEnabled:      cfg.promSchemaUsageEnabled,
-    promSchemaUsageIncludeOpSha: cfg.promSchemaUsageIncludeOpSha,
-    promSchemaUsageSampleRate:   cfg.promSchemaUsageSampleRate,
+    promSchemaUsageEnabled:      cfg.promSchemaUsageEnabled,
+    promSchemaUsageIncludeOpSha: cfg.promSchemaUsageIncludeOpSha,
+    promSchemaUsageSampleRate:   sr,
   }
 }

Also applies to: 42-45, 55-58

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f28c5d and 76ee54a.

📒 Files selected for processing (13)
  • router-tests/prometheus_improved_test.go (2 hunks)
  • router-tests/testenv/testenv.go (2 hunks)
  • router/core/graph_server.go (1 hunks)
  • router/core/operation_metrics.go (5 hunks)
  • router/core/router.go (1 hunks)
  • router/core/router_metrics.go (4 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/config_test.go (2 hunks)
  • router/pkg/config/fixtures/full.yaml (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/metric/config.go (1 hunks)
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/testdata/config_full.json
  • router-tests/testenv/testenv.go
  • router/core/router_metrics.go
  • router/pkg/config/fixtures/full.yaml
  • router-tests/prometheus_improved_test.go
  • router/pkg/config/config_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_full.json
  • router/core/graph_server.go
  • router/core/router_metrics.go
  • router-tests/prometheus_improved_test.go
  • router/pkg/config/config.schema.json
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/core/router.go
  • router/pkg/metric/config.go
  • router/core/router_metrics.go
  • router/pkg/config/config.schema.json
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.

Applied to files:

  • router/pkg/config/config.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/graph_server.go
  • router/core/router_metrics.go
  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/core/graph_server.go
  • router/core/router_metrics.go
  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router/core/graph_server.go
  • router/core/router_metrics.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router/core/graph_server.go
  • router/core/router_metrics.go
  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/graph_server.go
  • router/core/router_metrics.go
  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router-tests/prometheus_improved_test.go
🧬 Code graph analysis (6)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsage (115-119)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsage (41-48)
router/core/router.go (1)
router/pkg/config/config.go (2)
  • Metrics (130-135)
  • Prometheus (99-113)
router/core/graph_server.go (1)
router/pkg/config/config.go (1)
  • Prometheus (99-113)
router/core/operation_metrics.go (1)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router-tests/prometheus_improved_test.go (3)
router-tests/testenv/testenv.go (6)
  • Run (105-122)
  • Config (285-342)
  • MetricOptions (263-277)
  • PrometheusSchemaFieldUsage (279-283)
  • Environment (1733-1769)
  • GraphQLRequest (1909-1917)
router/pkg/metric/config.go (2)
  • Config (111-135)
  • PrometheusSchemaFieldUsage (41-48)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
router/pkg/config/config_test.go (1)
router/pkg/config/config.go (5)
  • LoadConfig (1113-1225)
  • Config (1007-1082)
  • Telemetry (150-156)
  • Metrics (130-135)
  • Prometheus (99-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (16)
router/pkg/config/testdata/config_defaults.json (1)

65-66: LGTM! Default SampleRate configuration looks correct.

The addition of SampleRate: 1 (100% sampling) as the default is appropriate and maintains backward compatibility by not reducing metrics collection by default.

Note: The value is represented as integer 1 rather than float 1.0. While JSON handles both representations correctly, this is slightly inconsistent with the Go envDefault:"1.0" in config.go line 118. This is acceptable but worth noting for consistency.

router/pkg/config/fixtures/full.yaml (1)

179-180: LGTM! Helpful documentation for sampling configuration.

The addition of sample_rate: 1.0 with the inline comment explaining supported values is excellent documentation. The change of include_operation_sha from false to true appears intentional to demonstrate both features together in the "full" configuration fixture.

router/core/router.go (1)

2315-2315: LGTM! Proper configuration wiring.

The SampleRate field is correctly populated from the configuration source, following the same pattern as the other fields in the PrometheusSchemaFieldUsage struct.

router/pkg/config/config.go (1)

116-118: LGTM! Configuration field properly defined.

The SampleRate field is correctly configured with appropriate defaults and tags:

  • envDefault:"1.0" ensures 100% sampling by default, maintaining backward compatibility
  • Environment variable PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE follows naming conventions
  • YAML key sample_rate uses snake_case consistently with other fields
router/pkg/config/testdata/config_full.json (1)

95-96: LGTM! Test configuration updated correctly.

The addition of SampleRate: 1 and enabling IncludeOperationSha: true in the "full" configuration test data appropriately demonstrates both schema field usage features working together.

router/pkg/metric/config.go (1)

44-47: LGTM! Field properly validated through JSON schema.

The SampleRate field is well-documented with clear comments, and range validation (0.0 ≤ SampleRate ≤ 1.0) is enforced in the JSON schema (config.schema.json, line 1249-1253) with explicit minimum and maximum constraints. The default value of 1.0 (100% sampling) is appropriately documented.

router-tests/testenv/testenv.go (2)

1518-1522: Wiring LGTM.

Propagation of SampleRate into Prometheus config is correct.


1518-1522: The review comment is incorrect and should be disregarded.

The SampleRate field is float64 (non-pointer), not *float64. The proposed diff checks if testConfig.MetricOptions.PrometheusSchemaFieldUsage.SampleRate == nil, which would not compile—you cannot check nil on a non-pointer type.

The current code is already correct. The config.schema.json specifies a default value of 1.0 for sample_rate, so schema validation ensures the value is always set to 1.0 when unspecified in configuration. The current pass-through works as intended.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (1)

74-77: Propagation into OperationMetricsOptions looks correct.

Sampler flags and rate are passed to OperationMetrics as expected.

router/core/graph_server.go (1)

922-924: LGTM! Clean initialization of Prometheus schema usage configuration.

The field initialization correctly sources values from the configuration, including the newly added SampleRate field and the renamed IncludeOpSha field (previously IncludeOperationSha).

router/pkg/config/config_test.go (1)

652-693: LGTM! Comprehensive test coverage for SampleRate configuration.

The test cases properly verify:

  • Loading SampleRate from YAML configuration (0.5)
  • Default value when not specified (1.0 = 100% sampling)
  • Environment variable override (PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE)
router-tests/prometheus_improved_test.go (2)

359-403: LGTM! Deterministic test with 100% sampling.

This test correctly verifies that 100% sampling tracks all requests with accurate counts. The assertions are precise and will not suffer from probabilistic failures.


405-439: LGTM! Deterministic test with 0% sampling.

This test correctly verifies that 0% sampling records no metrics. The logic is sound and deterministic.

router/core/operation_metrics.go (3)

47-50: LGTM! Clean struct for field usage aggregation.

The usageKey struct appropriately captures the composite key (fieldName, parentType) needed to aggregate schema field usage metrics.


92-129: LGTM! Correct implementation of sampling and aggregation logic.

The refactored Finish() method correctly:

  1. Gates metric emission on shouldSampleOperation()
  2. Aggregates field usage counts using the usageKey map
  3. Emits aggregated counts via MeasureSchemaFieldUsage
  4. Conditionally includes operation SHA256 based on configuration

The aggregation approach efficiently reduces metric cardinality by emitting one metric per unique field rather than one per field occurrence.


148-164: LGTM! Proper initialization of new configuration fields.

The constructor correctly initializes the Prometheus schema usage fields from the provided options, including the new promSchemaUsageSampleRate field.

Comment thread router-tests/prometheus_improved_test.go
Comment thread router/core/operation_metrics.go Outdated
Comment thread router-tests/prometheus_improved_test.go Outdated
Comment thread router-tests/prometheus_improved_test.go Outdated
Comment thread router/core/operation_metrics.go Outdated
Comment thread router/core/operation_metrics.go Outdated
StarpTech and others added 3 commits November 11, 2025 15:25
Co-authored-by: Jens Neuse <jens.neuse@gmx.de>
Co-authored-by: Jens Neuse <jens.neuse@gmx.de>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
router/core/operation_metrics.go (1)

94-96: Consider adding a clarifying comment about early return scope.

The early return here exits the entire Finish method when sampling fails. While this is safe (all other metrics have been recorded by this point), a brief comment could improve clarity for future maintainers.

Optional addition:

 	if !m.shouldSampleOperation() {
+		// Skip Prometheus schema usage metrics for unsampled operations
+		// (all other request metrics have already been recorded above)
 		return
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76ee54a and 807ddbd.

📒 Files selected for processing (2)
  • router-tests/prometheus_improved_test.go (7 hunks)
  • router/core/operation_metrics.go (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/prometheus_improved_test.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/core/operation_metrics.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router/core/operation_metrics.go
🧬 Code graph analysis (1)
router/core/operation_metrics.go (1)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (7)
router/core/operation_metrics.go (7)

5-5: Good use of math/rand/v2 for auto-seeded random sampling.

The import of math/rand/v2 correctly addresses the past review concern about unseeded random number generation. This version is auto-seeded and provides non-deterministic sampling across instances without requiring explicit initialization.


47-50: Clean struct design for usage aggregation.

The usageKey struct appropriately encapsulates the compound key for tracking field usage by field name and parent type, enabling efficient counting in the map.


42-45: Field additions look good.

The renamed promSchemaUsageIncludeOpSha and new promSchemaUsageSampleRate field are appropriately typed and follow consistent naming conventions.


108-130: Counting and aggregation logic looks solid.

The usage counting mechanism correctly:

  • Aggregates field usage by (fieldName, parentType) compound key
  • Emits metrics with proper counts instead of individual events
  • Combines operation and field attributes appropriately

This reduces metric cardinality and improves efficiency.


144-146: Options struct updated correctly.

The OperationMetricsOptions fields align properly with the OperationMetrics struct, maintaining naming consistency and appropriate types.


164-167: Constructor initialization is correct.

All Prometheus schema usage fields are properly initialized from the options parameter.


170-190: Sampling logic is well-implemented and documented.

The shouldSampleOperation method correctly:

  • Handles boundary conditions (sample rates of 0.0, 1.0)
  • Implements probabilistic sampling with uniform distribution
  • Provides clear documentation explaining the approach

The use of rand.Float64() < m.promSchemaUsageSampleRate ensures proper statistical coverage across all operations.

Comment thread router/core/operation_metrics.go
Comment thread router/core/operation_metrics.go
Comment thread router-tests/prometheus_improved_test.go Outdated
Comment thread router/pkg/config/config.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 807ddbd and b3583e7.

📒 Files selected for processing (2)
  • router-tests/prometheus_improved_test.go (7 hunks)
  • router/core/operation_metrics.go (5 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/prometheus_improved_test.go
  • router/core/operation_metrics.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-09-17T13:51:50.823Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2216
File: router-tests/authentication_test.go:847-847
Timestamp: 2025-09-17T13:51:50.823Z
Learning: Go 1.22 introduced the ability to range over integers, making `for range n` syntax valid for iterating n times in loops.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/prometheus_improved_test.go
  • router/core/operation_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router-tests/prometheus_improved_test.go
🧬 Code graph analysis (2)
router-tests/prometheus_improved_test.go (3)
router-tests/testenv/testenv.go (6)
  • Run (105-122)
  • Config (285-342)
  • MetricOptions (263-277)
  • PrometheusSchemaFieldUsage (279-283)
  • Environment (1733-1769)
  • GraphQLRequest (1909-1917)
router/pkg/metric/config.go (2)
  • Config (111-135)
  • PrometheusSchemaFieldUsage (41-48)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
router/core/operation_metrics.go (1)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (9)
router-tests/prometheus_improved_test.go (3)

40-41: LGTM: SampleRate additions to existing tests.

Setting SampleRate: 1.0 ensures 100% sampling for deterministic test behavior, which is appropriate for tests that verify specific metric counts and labels.

Also applies to: 133-134, 209-209, 247-247, 291-291


364-408: LGTM: 100% sampling test is correct.

This test correctly validates that 100% sampling tracks all requests with accurate counter values. The use of for range 10 is valid in Go 1.22+.


410-444: LGTM: 0% sampling test is correct.

This test correctly validates that 0% sampling produces no metrics, with appropriate defensive checks for nil results.

router/core/operation_metrics.go (6)

42-50: LGTM: Field updates and usageKey struct.

The renamed field promSchemaUsageIncludeOpSha (shorter name) and new promSchemaUsageSampleRate field align with the PR objectives. The usageKey struct provides a clean way to aggregate field usage counts by field name and parent type.


92-132: LGTM: Sampling and aggregation logic in Finish method.

The implementation correctly:

  • Performs early return when sampling fails (addresses past feedback)
  • Includes bounds check for field.Path length (addresses past critical issue)
  • Aggregates field usage counts using the usageKey struct
  • Emits metrics with accurate counts using int64(count)

The logic ensures that when a request is sampled, all field usage within that request is aggregated and emitted with proper counts.


170-190: LGTM: Probabilistic sampling implementation.

The shouldSampleOperation() method correctly implements probabilistic sampling with:

  • Proper edge case handling (0% and 100% rates)
  • Non-deterministic random sampling using math/rand/v2
  • Clear documentation explaining the choice of probabilistic over hash-based sampling

The approach ensures uniform statistical coverage across all operations.


144-146: LGTM: Updated OperationMetricsOptions fields.

The renamed PrometheusSchemaUsageIncludeOpSha field and new PrometheusSchemaUsageSampleRate field are consistent with the struct field changes and PR objectives.


164-166: LGTM: Initialization of sampling fields.

The initialization correctly assigns all three Prometheus schema usage configuration fields from the options struct.


5-5: Go version compatibility confirmed.

The project specifies Go 1.25 in both router/go.mod and router-tests/go.mod, which exceeds the Go 1.22+ requirement for math/rand/v2. The import correctly addresses the deterministic sampling concern from past reviews with an auto-seeded implementation.

Comment thread router-tests/prometheus_improved_test.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3583e7 and e138c48.

📒 Files selected for processing (1)
  • router-tests/prometheus_improved_test.go (7 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-09-17T13:51:50.823Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2216
File: router-tests/authentication_test.go:847-847
Timestamp: 2025-09-17T13:51:50.823Z
Learning: Go 1.22 introduced the ability to range over integers, making `for range n` syntax valid for iterating n times in loops.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router-tests/prometheus_improved_test.go
🧬 Code graph analysis (1)
router-tests/prometheus_improved_test.go (3)
router-tests/testenv/testenv.go (5)
  • Run (105-122)
  • Config (285-342)
  • MetricOptions (263-277)
  • PrometheusSchemaFieldUsage (279-283)
  • Environment (1733-1769)
router/pkg/metric/config.go (2)
  • Config (111-135)
  • PrometheusSchemaFieldUsage (41-48)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)

Comment thread router-tests/prometheus_improved_test.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router-tests/prometheus_improved_test.go (1)

318-370: Well-designed sampling test with appropriate parameters.

The test correctly addresses flakiness concerns from past reviews by using 10% sampling (rather than 1%), which gives a negligible probability of zero samples (≈0.00265%). The assertions properly validate sampling behavior by checking total field counts rather than just metric existence, which would fail to detect a regression that ignores the sample rate.

For even greater reliability, consider increasing to 1000 requests as suggested in past reviews:

-			// Make 100 requests
-			for i := 0; i < 100; i++ {
+			// Make 1000 requests for maximum sampling distribution stability
+			for i := 0; i < 1000; i++ {
 				res := xEnv.MakeGraphQLRequestOK(testenv.GraphQLRequest{
 					Query: `query myQuery { employee(id: 1) { id } }`,
 				})
 				require.JSONEq(t, `{"data":{"employee":{"id":1}}}`, res.Body)
 			}

 			mf, err := promRegistry.Gather()
 			require.NoError(t, err)

 			schemaUsage := findMetricFamilyByName(mf, SchemaFieldUsageMetricName)
 			assert.NotNil(t, schemaUsage)

 			schemaUsageMetrics := schemaUsage.GetMetric()

 			require.Greater(t, len(schemaUsageMetrics), 0, "At least 1 request should be sampled")

-			// With 10% sampling and 100 requests, each sampled request increments two field counters (`employee` and `id`).
-			// 100% sampling would produce 200 total field counts (100 requests * 2 fields), so a reduced total confirms sampling worked.
+			// With 10% sampling and 1000 requests, each sampled request increments two field counters (`employee` and `id`).
+			// 100% sampling would produce 2000 total field counts (1000 requests * 2 fields), so a reduced total confirms sampling worked.
 			totalFieldCounts := 0.0
 			for _, m := range schemaUsageMetrics {
 				counter := m.GetCounter()
 				require.NotNil(t, counter)
 				totalFieldCounts += counter.GetValue()
 			}

 			require.Greater(t, totalFieldCounts, 0.0, "At least one sampled field is expected with a 10% sample rate")
-			require.Less(t, totalFieldCounts, 200.0, "Sampling should record fewer than 100% of requests (200 total field counts)")
+			require.Less(t, totalFieldCounts, 2000.0, "Sampling should record fewer than 100% of requests (2000 total field counts)")

 			// Verify that the sampled metrics have correct structure
 			for _, m := range schemaUsageMetrics {
 				assertLabelValue(t, m.Label, otel.WgOperationName, "myQuery")
 				assertLabelValue(t, m.Label, otel.WgOperationType, "query")
 			}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e138c48 and bc0e073.

📒 Files selected for processing (1)
  • router-tests/prometheus_improved_test.go (7 hunks)
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-09-17T13:51:50.823Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2216
File: router-tests/authentication_test.go:847-847
Timestamp: 2025-09-17T13:51:50.823Z
Learning: Go 1.22 introduced the ability to range over integers, making `for range n` syntax valid for iterating n times in loops.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router-tests/prometheus_improved_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router-tests/prometheus_improved_test.go
🧬 Code graph analysis (1)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (6)
  • Run (105-122)
  • Config (285-342)
  • MetricOptions (263-277)
  • PrometheusSchemaFieldUsage (279-283)
  • Environment (1733-1769)
  • GraphQLRequest (1909-1917)
router/pkg/config/config.go (2)
  • Config (1007-1082)
  • PrometheusSchemaFieldUsage (115-119)
router/pkg/metric/config.go (2)
  • Config (111-135)
  • PrometheusSchemaFieldUsage (41-48)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
router-tests/prometheus_improved_test.go (4)

7-8: LGTM!

The import reordering and alias are appropriate for the new usage in the helper function.


40-41: LGTM!

Explicitly setting SampleRate: 1.0 ensures backward compatibility and makes the test behavior clear now that sampling is configurable.


372-416: LGTM!

The test correctly validates that 100% sampling tracks all requests. The use of for range 10 is valid (Go 1.22+), and the assertions properly verify that each field is counted exactly 10 times.


418-452: LGTM!

The test correctly validates that 0% sampling disables metric tracking. The defensive nil check on line 448 is good practice, as the metric family may not exist when no metrics are recorded.

@StarpTech StarpTech merged commit 1a0cb7b into main Nov 11, 2025
29 checks passed
@StarpTech StarpTech deleted the dustin/eng-8469-add-documentation-and-inform-customer branch November 11, 2025 15:37
Copy link
Copy Markdown
Contributor

@ysmolski ysmolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I was too late...

type PrometheusSchemaFieldUsage struct {
Enabled bool
IncludeOperationSha bool
// SampleRate controls the percentage of requests to sample for schema field usage metrics (0.0 to 1.0).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the "percentage of requests", but "the fraction of requests" to be precise.

Suggested change
// SampleRate controls the percentage of requests to sample for schema field usage metrics (0.0 to 1.0).
// SampleRate controls the fraction of requests to sample for schema field usage metrics (0.0 to 1.0).

Comment on lines +178 to +179
// Note: Uses non-deterministic random sampling rather than hash-based sampling because
// sequential request IDs produce clustered hash values that break deterministic sampling.
Copy link
Copy Markdown
Contributor

@ysmolski ysmolski Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Kind of contradictory statement. "Uses non-deterministic" conflicts with "clustered hash values that break deterministic sampling".

Suggested change
// Note: Uses non-deterministic random sampling rather than hash-based sampling because
// sequential request IDs produce clustered hash values that break deterministic sampling.
// Note: Uses non-deterministic random sampling rather than hash-based sampling because
// sequential request IDs produce clustered hash values that are not distributed uniformly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen this. Thanks for the suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants